Filtering transcriptions of utterances

ABSTRACT

A method for facilitating mobile phone messaging, such as text messaging and instant messaging, includes receiving audio data communicated from the mobile communication device, the audio data representing an utterance that is intended to be at least a portion of the text of the message that is to be sent from the mobile phone to a recipient; transcribing the utterance to text based on the received audio data to generate a transcription; and applying a filter to the transcribed text to generate a filtered transcription, the text of which is intended to mimic language patterns of mobile device messaging that is performed manually by users. The method may also be applied to the audio data of a voicemail, with the filtered, transcribed text being communicated to a mobile phone as, for example, an SMS text message.

I. CROSS-REFERENCE TO RELATED APPLICATION

The present application is a U.S. continuation-in-part patentapplication of, and claims priority under 35 U.S.C. §120 to U.S.nonprovisional patent application Ser. No. 11/697,074, filed Apr. 5,2007, which '074 application published as U.S. patent applicationpublication number US 2007/0239837, and which '074 application is anonprovisional patent application of U.S. provisional patent applicationSer. No. 60/789,837, filed Apr. 5, 2006.

The present application also is a U.S. nonprovisional patent applicationof, and claims priority under 35 U.S.C. § 119(e) to, each of thefollowing:

(a) U.S. provisional patent application Ser. No. 60/957,701, filed Aug.23, 2007;(b) U.S. provisional patent application Ser. No. 60/957,702, filed Aug.23, 2007;(c) U.S. provisional patent application Ser. No. 60/957,706, filed Aug.23, 2007;(d) U.S. provisional patent application Ser. No. 60/972,851, filed Sep.17, 2007;(e) U.S. provisional patent application Ser. No. 60/972,853, filed Sep.17, 2007;(f) U.S. provisional patent application Ser. No. 60/972,854, filed Sep.17, 2007;(g) U.S. provisional patent application Ser. No. 60/972,936, filed Sep.17, 2007;(h) U.S. provisional patent application Ser. No. 60/972,943, filed Sep.17, 2007;(i) U.S. provisional patent application Ser. No. 60/972,944, filed Sep.17, 2007;(j) U.S. provisional patent application Ser. No. 61/016,586, filed Dec.25, 2007;(k) U.S. provisional patent application Ser. No. 61/021,335, filed Jan.16, 2008;(l) U.S. provisional patent application Ser. No. 61/021,341, filed Jan.16, 2008;(m) U.S. provisional patent application Ser. No. 61/034,815, filed Mar.7, 2008;(n) U.S. provisional patent application Ser. No. 61/038,046, filed Mar.19, 2008;(o) U.S. provisional patent application Ser. No. 61/041,219, filed Mar.31, 2008; and(q) U.S. provisional patent application Ser. No. 61/091,330, filed Aug.22, 2008.

Each of the foregoing patent applications from which priority isclaimed, and any corresponding patent application publications thereof,are hereby incorporated herein by reference in their entirety.Additionally, the disclosure of provisional application 60/789,837 iscontained in Appendix A attached hereto and, likewise, is incorporatedherein in its entirety by reference and is intended to providebackground and technical information with regard to the systems andenvironments of the inventions of the current provisional patentapplication. Similarly, the disclosure of the brochure of Appendix B isincorporated herein in its entirety by reference.

Finally, the disclosures of each of the following patent applications,and any corresponding patent application publications thereof, areincorporated herein by reference: U.S. nonprovisional patent applicationSer. No. 12/______, filed Aug. 25, 2008 and titled “FACILITATINGPRESENTATION BY MOBILE DEVICE OF ADDITIONAL CONTENT FOR A WORD OR PHRASEUPON UTTERANCE THEREOF,” which application is a continuation-in-part ofU.S. nonprovisional patent application Ser. No. 12/197,213, filed Aug.22, 2008; and U.S. nonprovisional patent application Ser. No.12/197,227, filed Aug. 22, 2008.

II. COPYRIGHT STATEMENT

All of the material in this patent document is subject to copyrightprotection under the copyright laws of the United States and of othercountries. The copyright owner has no objection to the facsimilereproduction by anyone of the patent document or the patent disclosure,as it appears in the governmental files or records, but otherwisereserves all copyright rights whatsoever.

III. BACKGROUND OF THE PRESENT INVENTION

Automatic Speech Recognition (“ASR”) systems convert speech into text.As used herein, the term “speech recognition” refers to the process ofconverting a speech (audio) signal to a sequence of words or arepresentation thereof (text), by means of an algorithm implemented as acomputer program. Speech recognition applications that have emerged overthe last few years include voice dialing (e.g., “Call home”), callrouting (e.g., “I would like to make a collect call”), simple data entry(e.g., entering a credit card number), preparation of structureddocuments (e.g., a radiology report), and content-based spoken audiosearching (e.g. finding a podcast where particular words were spoken).

As their accuracy has improved, ASR systems have become commonplace inrecent years. For example, ASR systems have found wide application incustomer service centers of companies. The customer service centersoffer middleware and solutions for contact centers. For example, theyanswer and route calls to decrease costs for airlines, banks, etc. Inorder to accomplish this, companies such as IBM and Nuance create assetsknown as IVR (Interactive Voice Response) that answer the calls, thenuse an ASR system paired with TTS (Text-To-Speech) software to decodewhat the caller is saying and communicate back to him.

More recently, ASR systems have found application with regard to textmessaging. Text messaging usually involves the input of a text messageby a sender who presses letters and/or numbers associated with thesender's mobile phone. As recognized for example in the aforementioned,commonly-assigned U.S. patent application Ser. No. 11/697,074, it can beadvantageous to make text messaging far easier for an end user byallowing the user to dictate his or her message rather than requiringthe user to type it into her phone. In certain circumstances, such aswhen a user is driving a vehicle, typing a text message may not bepossible and/or convenient, and may even be unsafe. On the other hand,text messages can be advantageous to a message receiver as compared tovoicemail, as the receiver actually sees the message content in awritten format rather than having to rely on an auditory signal.

Many other applications for speech recognition and ASR systems will berecognized as well.

Currently, the state-of-the-art speech transcription engines usestatistical language models (“SLMs”) to transcribe free-form speech intotext. This is in contrast to using finite grammars which describepatterns of words which can be spoken by the user and received andprocessed by the ASR system. Finite grammars are much more limited tophrases, which the engine can recognize, but generally provide betteraccuracy. The current state of speech recognition engines allows eitheran SLM or a finite grammar to be active when transcribing speech fromaudio data, but not both at the same time.

Thus, an approach is needed where an ASR system makes use of both theSLM for returning results from the audio data, and finite grammars usedto post-process the text results. An approach is also needed wherecustom filters are used that are configured to detect and modify wordsand word groups. Using this approach permits text results to begenerated that can be presented to a user formatted in a way that looksmore typical of how a human would have written a text message. It willbe recognized that this same principle is useful in other applicationsof ASR engines as well.

IV. SUMMARY OF THE INVENTION

The present invention includes many aspects and features. Moreover,while many aspects and features relate to, and are described in, thecontext of instant messaging and SMS messaging, the present invention isnot limited to use only in such contexts, as will become apparent fromthe following summaries and detailed descriptions of aspects, features,and one or more embodiments of the present invention. For instance, theinvention is equally applicable to use in the context of voicemails andemails.

Accordingly, in a first aspect of the invention a method forfacilitating mobile device messaging includes the steps of: receivingaudio data communicated from the mobile communication device, the audiodata representing an utterance that is intended to be at least a portionof the text of the message that is to be sent from the mobilecommunication device to a recipient; transcribing the utterance to textbased on the received audio data to generate a transcription; applying afilter to the transcribed text to generate a filtered transcription, thetext of which is intended to mimic language patterns of mobile devicemessaging that is performed manually by users; and communicating thefiltered transcription to the recipient.

In a feature of this aspect, the mobile communication device, to whichthe filtered transcription is communicated, is the mobile communicationdevice from which the audio data is received.

In a feature of this aspect, the mobile communication device, to whichthe filtered transcription is communicated, is a mobile communicationdevice of the recipient of the message.

In features of this aspect, the audio data is communicated from themobile communication device using the HTTP/HTTPS protocol and iscommunicated over the Internet.

In another feature of this aspect, the utterance is transcribed using alanguage model such as a statistical language model (“SLM”) or aHierarchical Language Model (“HLM”).

In a feature of this aspect, a filter may include a list ofpredetermined words (e.g., a list of predetermined words comprising ahash table). Each predetermined word of the list is associated withanother predetermined word. In this regard, the step of applying afilter to the transcribed text includes comparing words from thetranscribed text to the list of words of the filter and, upon a matchingword, replacing the matching word with the associated, predeterminedword as specified by the filter. Furthermore, as used herein, a “word”means in preferred embodiments an alphanumeric string (whether found ina dictionary or not) as well as a phrase, i.e., a grouping of words.Moreover, the grouping of words collectively may have a meaning that maybe distinct from the meaning of any individual word (an example of sucha “word” is an idiom like “holy cow”).

In another feature of this aspect, the filter that is applied comprisesa finite grammar.

In another feature of this aspect, the filter that is applied comprisesa software filter.

In another feature of this aspect, the method further includes the stepof selecting one or more filters to apply to the transcribed text from agroup of filters that may be applied to the transcribed text to generatethe filtered transcription. In this respect, the selection of the one ormore filters to apply may be made based on an indication that isreceived in conjunction with the recorded audio data received from themobile communication device. Alternatively, the selection of the filtersto apply to the transcribed text may be made on based on an indicationis included within a header of the communication from the mobilecommunication device in which the audio data is received; or theselection of the one or more filters to apply may be made based onpreferences of a user of a mobile communication device, including theuser of the mobile communication device from which the audio data isreceived or a user of a mobile device to which the message is sent.

In another feature, a filter may include a list of respective,predetermined operations that are performed for a predetermined word orother characteristic found in the text of transcribed utterance. Forexample, a predetermined operation may include the insertion ofpunctuation when a certain silence threshold is reached in theutterance. Another predetermined operation may include the insertion ofa targeted advertising based on a predetermined word that is found inthe transcribed list. Moreover, such targeted ad insertion may furtherbe based on location information of the mobile communication device,which may be communicated from the mobile device and which may bedetermined by the mobile communication device using a GPS component ofthe mobile communication device.

The filter that is applied preferably includes one or more of thefollowing types of filters: an ad filter; a caller name filter; a callernumber filter; a closing filter; a contraction filter; a currencyfilter; a date filter; a digit filter; a digit format filter; a digithomonym filter; an engine filter; a greeting filter; a hyphenate filter;a number filter; a profanity filter; an ordinal filter; a proper nounfilter; a punctuation filter; a sentence filter; a shout/scream filter;an SMS filter; a tag filter; and a time filter.

With regard to the ad filter, when the ad filter is applied to thetranscribed text, an advertisement is inserted into the transcribed textbased on, and in association with, predetermined keywords that areidentified in the transcribed text.

In another feature, the mobile communication device is a mobile phone,such as a smartphone or similar device, including the current iPhonemanufactured by Apple or the Razr line of phone manufactured byMotorola.

In another aspect, a method for facilitating mobile device messagingincludes the steps of: receiving from a mobile communication device,both a destination address for sending a message to a recipient, andaudio data representing an utterance that represents the text of themessage that is to be sent to the recipient; transcribing the utteranceto text based on the received audio data to generate a transcription;applying a filter to the transcribed text to generate a filteredtranscription, the text of which is intended to mimic language patternsof mobile device messaging that is performed manually by users; andcommunicating to the recipient the filtered transcription as the text ofthe message.

In another aspect of the invention, a method for facilitating mobiledevice messaging includes the steps of: receiving from a mobilecommunication device, both a destination address for sending a messageto a recipient, and audio data representing an utterance that representsthe text of the message that is to be sent to the recipient;transcribing the utterance to text based on the received audio data togenerate a transcription; applying a filter to the transcribed text togenerate a filtered transcription, the text of which is intended tomimic language patterns of mobile device messaging that is performedmanually by users; communicating to the filtered transcription to themobile communication device; presenting the filtered transcription bythe mobile communication device for verifying; and sending to therecipient from the mobile communication device the filteredtranscription as the text of the message.

In a feature of this aspect, the method further includes revising thefiltered transcription presented by the mobile communication device forverifying. In this regard, the filtered transcription that is sent asthe text of the message is a revised, filtered transcription.

In another aspect of the invention, a method facilitating mobile devicemessaging includes the steps of: receiving audio data representing avoicemail that has been left for a recipient; transcribing the voicemailto text based on the received audio data to generate a transcription;applying a filter to the transcribed text to generate a filteredtranscription, the text of which is intended to mimic language patternsof mobile device messaging that is performed manually by users; andcommunicating the filtered transcription to a mobile communicationdevice of the recipient.

In a feature of this aspect, the filtered transcription is communicatedas a text message, using the SMS protocol, to the mobile communicationdevice of the recipient of the voicemail.

In a feature of this aspect, the filtered transcription is communicatedas an instant message to the mobile communication device of therecipient of the voicemail.

In a feature of this aspect, the filtered transcription is communicatedas an email to the mobile communication device of the recipient of thevoicemail.

In a feature of this aspect, the filter that is applied to thetranscribed text to generate the filtered transcription includes asentence punctuation filter that inserts a sentence punctuationcharacter into the transcribed text based on a duration of silencebetween two words in the recorded audio data. In this regard, apronunciation preferably is inserted into the transcribed text when aduration of silence between two words in the recorded audio data exceedsa predetermined threshold value. For example, a comma is inserted intothe transcribed text when a duration of silence between two words in therecorded audio data exceeds a first predetermined threshold value (suchas 0.20 milliseconds) but does not exceed a second predeterminedthreshold value (such as 0.49 milliseconds), the second predeterminedthreshold being greater than the first predetermined threshold value.Moreover, a period then is inserted into the transcribed text when aduration of silence between two words in the recorded audio data exceedsthe second predetermined threshold value, and the first letter of theword immediately following the duration of silence that exceeds thesecond predetermined threshold value is capitalized.

In another feature of this aspect, the filter that is applied to thetranscribed text to generate the filtered transcription includes a digithomonym filter. The digit homonym filter inserts a digit, insubstitution for a word that is a homonym to the digit, when such wordis found immediately in-between two digits in the transcribed text. Thedigit homonym filter preferably is applied after a digit filter isapplied, which filter converts words into digits when determined to beappropriate.

In another feature of this aspect, the utterance is transcribed using alanguage model comprising a statistical language model.

In another feature of this aspect, the utterance is transcribed using alanguage model comprising a hierarchical language model.

In another feature of this aspect, a filter includes a list ofpredetermined words, including phrases and alphanumeric strings, whereineach predetermined word is associated with another predetermined word,including a predetermined phrase or a predetermined alphanumeric string.The of applying a filter to the transcribed text in such case includescomparing words, including phrases and alphanumeric strings, from thetranscribed text to the list of words of the filter and, upon a match,replacing the matching word, including a phrase or alphanumeric string,with the associated, predetermined word including a predetermined phraseor a predetermined alphanumeric string.

In another feature of this aspect, the filter that is applied comprisesa finite grammar.

In another feature of this aspect, the filter that is applied comprisesa software filter.

In another feature of this aspect, the method further includes the stepof selecting one or more filters to apply to the transcribed text from agroup of filters that may be applied to the transcribed text to generatethe filtered transcription. The selection of the one or more filters toapply may be made based on an indication that is received in conjunctionwith the recorded audio data received representing the voicemail; or maybe made based on preferences of the recipient of the voicemail.

The group of filters preferably includes: a caller name filter; a callernumber filter; a closing filter; a contraction filter; a currencyfilter; a date filter; a digit filter; a digit format filter; a digithomonym filter; an engine filter; a greeting filter; a hyphenate filter;a number filter; a profanity filter; an ordinal filter; a proper nounfilter; a punctuation filter; a sentence filter; a shout/scream filter;an SMS filter; a tag filter; and a time filter.

In yet another feature of this aspect, the step of applying a filter tothe transcribed text to generate a filtered transcription includesapplying an ad filter, whereby advertisement is inserted into thetranscribed text based on, and in association with, predeterminedkeywords that are identified in the transcribed text.

In another feature, the mobile communication device comprises a mobilephone.

In another aspect of the invention, a method includes the steps of:receiving audio data communicated representing an utterance;transcribing the utterance to text based on the received audio data togenerate a transcription; and applying a filter to the transcribed textto generate a filtered transcription; wherein the filter that is appliedto the transcribed text to generate the filtered transcription includesa sentence punctuation filter that inserts a sentence punctuationcharacter into the transcribed text based on a duration of silencebetween two words in the recorded audio data.

In a feature, a character is inserted into the transcribed text when aduration of silence between two words in the recorded audio data exceedsa predetermined threshold value.

In a feature, a comma is inserted into the transcribed text when aduration of silence between two words in the recorded audio data exceedsa first predetermined threshold value but does not exceed a secondpredetermined threshold value, the second predetermined threshold beinggreater than the first predetermined threshold value. Preferably, aperiod preferably is inserted into the transcribed text when a durationof silence between two words in the recorded audio data exceeds thesecond predetermined threshold value, and the method further includescapitalizing the first letter of the word immediately following theduration of silence that exceeds the second predetermined thresholdvalue.

In yet another aspect of the invention, a method includes the steps of:receiving audio data communicated representing an utterance;transcribing the utterance to text based on the received audio data togenerate a transcription; and applying a filter to the transcribed textto generate a filtered transcription; wherein the filter that is appliedto the transcribed text to generate the filtered transcription includesa digit homonym filter that inserts a digit, in substitution for a wordthat is a homonym to the digit, when such word is found immediatelyin-between two digits in the transcribed text.

In a feature of the invention, a digit filter is first applied to thetranscribed utterance before the digit homonym filter is applied to thetranscribed utterance.

In a feature of the invention, the digit homonym filter includes a listof predetermined words that are homonyms to digits. In this respect, thelist of the digit homonym filter comprises a hash table. Preferably, thewords “for”, “won”, “ate”, “to”, and “too” are represented in the list,and are replaced respectively by the filter with “4”, “1”, “8”, “2”, and“2”.

In addition to the aforementioned aspects and features of the presentinvention, it should be noted that the present invention furtherencompasses the various possible combinations and subcombinations ofsuch aspects and features.

V. BRIEF DESCRIPTION OF THE DRAWINGS

Further aspects, features, embodiments, and advantages of the presentinvention will become apparent from the following detailed descriptionwith reference to the drawings, wherein:

FIG. 1 is a block diagram of a communication system in accordance with apreferred embodiment of the present invention;

FIG. 2 is a block diagram of a communication system in accordance withanother preferred embodiment of the present invention;

FIG. 3 is a block diagram of an exemplary implementation of the systemof FIG. 1;

FIG. 4A is a block diagram illustrating a first user making use of aportion of the communication system of FIG. 1;

FIG. 4B is a graphical depiction, on a communication device, of thetranscription of the utterance of FIG. 4A;

FIG. 4C is a block diagram illustrating a second user making use of aportion of the communication system of FIG. 1;

FIG. 4D is a graphical depiction, on a receiving device, of thetranscription of the utterance of FIG. 4C;

FIG. 5 is a flowchart illustrating the operation of a speech engine, forexample of the ASR system of FIG. 1, in accordance with preferredembodiments of the present invention;

FIG. 6 is a log of utterances of an exemplary conversation between twousers;

FIG. 7 is a log illustrating unfiltered transcriptions of utterances ofthe exemplary conversation of FIG. 6

FIG. 8 is a log illustrating filtered transcriptions of utterances ofthe exemplary conversation of FIG. 6, shown with the indications ofsilence removed;

FIG. 9 is a log illustrating identification of word groupings offiltered transcriptions of utterances of the exemplary conversation ofFIG. 6;

FIG. 10 is a log illustrating filtered transcriptions of utterances ofthe exemplary conversation of FIG. 6, shown after groups of sequentialwords are applied to a finite grammar to convert the plain text into amore natural format;

FIG. 11 is a log illustrating filtered transcriptions of utterances ofthe exemplary conversation of FIG. 6, shown after being passed throughan SMS filter;

FIG. 12 is a block diagram of the system architecture of one commercialimplementation;

FIG. 13 is a block diagram of a portion of FIG. 12;

FIG. 14 is a typical header section of an HTTP request from the clientin the commercial implementation;

FIG. 15 illustrates exemplary protocol details for a request for alocation of a login server and a subsequent response;

FIG. 16 illustrates exemplary protocol details for a login request and asubsequent response;

FIG. 17 illustrates exemplary protocol details for a submit request anda subsequent response;

FIG. 18 illustrates exemplary protocol details for a results request anda subsequent response;

FIG. 19 illustrates exemplary protocol details for an XML hierarchyreturned in response to a results request;

FIG. 20 illustrates exemplary protocol details for a text to speechrequest and a subsequent response;

FIG. 21 illustrates exemplary protocol details for a correct request;

FIG. 22 illustrates exemplary protocol details for a ping request; and

FIG. 23 illustrates exemplary protocol details for a debug request.

VI. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

As a preliminary matter, it will readily be understood by one havingordinary skill in the relevant art (“Ordinary Artisan”) that the presentinvention has broad utility and application. Furthermore, any embodimentdiscussed and identified as being “preferred” is considered to be partof a best mode contemplated for carrying out the present invention.Other embodiments also may be discussed for additional illustrativepurposes in providing a full and enabling disclosure of the presentinvention. Moreover, many embodiments, such as adaptations, variations,modifications, and equivalent arrangements, will be implicitly disclosedby the embodiments described herein and fall within the scope of thepresent invention.

Accordingly, while the present invention is described herein in detailin relation to one or more embodiments, it is to be understood that thisdisclosure is illustrative and exemplary of the present invention, andis made merely for the purposes of providing a full and enablingdisclosure of the present invention. The detailed disclosure herein ofone or more embodiments is not intended, nor is to be construed, tolimit the scope of patent protection afforded the present invention,which scope is to be defined by the claims and the equivalents thereof.It is not intended that the scope of patent protection afforded thepresent invention be defined by reading into any claim a limitationfound herein that does not explicitly appear in the claim itself.

Thus, for example, any sequence(s) and/or temporal order of steps ofvarious processes or methods that are described herein are illustrativeand not restrictive. Accordingly, it should be understood that, althoughsteps of various processes or methods may be shown and described asbeing in a sequence or temporal order, the steps of any such processesor methods are not limited to being carried out in any particularsequence or order, absent an indication otherwise. Indeed, the steps insuch processes or methods generally may be carried out in variousdifferent sequences and orders while still falling within the scope ofthe present invention. Accordingly, it is intended that the scope ofpatent protection afforded the present invention is to be defined by theappended claims rather than the description set forth herein.

Additionally, it is important to note that each term used herein refersto that which the Ordinary Artisan would understand such term to meanbased on the contextual use of such term herein. To the extent that themeaning of a term used herein—as understood by the Ordinary Artisanbased on the contextual use of such term—differs in any way from anyparticular dictionary definition of such term, it is intended that themeaning of the term as understood by the Ordinary Artisan shouldprevail.

Furthermore, it is important to note that, as used herein, “a” and “an”each generally denotes “at least one,” but does not exclude a pluralityunless the contextual use dictates otherwise. Thus, reference to “apicnic basket having an apple” describes “a picnic basket having atleast one apple” as well as “a picnic basket having apples.” Incontrast, reference to “a picnic basket having a single apple” describes“a picnic basket having only one apple.”

When used herein to join a list of items, “or” denotes “at least one ofthe items,” but does not exclude a plurality of items of the list. Thus,reference to “a picnic basket having cheese or crackers” describes “apicnic basket having cheese without crackers”, “a picnic basket havingcrackers without cheese”, and “a picnic basket having both cheese andcrackers.” Finally, when used herein to join a list of items, “and”denotes “all of the items of the list.” Thus, reference to “a picnicbasket having cheese and crackers” describes “a picnic basket havingcheese, wherein the picnic basket further has crackers,” as well asdescribes “a picnic basket having crackers, wherein the picnic basketfurther has cheese.”

Referring now to the drawings, in which like numerals represent likecomponents throughout the several views, the preferred embodiments ofthe present invention are next described. The following description ofthe preferred embodiment(s) is merely exemplary in nature and is in noway intended to limit the invention, its application, or uses.

FIG. 1 is a block diagram of a communication system 10 in accordancewith a preferred embodiment of the present invention. As shown therein,the communication system 10 includes at least one transmitting device 12and at least one receiving device 14, one or more network systems 16 forconnecting the transmitting device 12 to the receiving device 14, and anASR system 18, including an ASR engine. Transmitting and receivingdevices 12,14 may include cell phones 21, smart phones 22, PDAs 23,tablet notebooks 24, various desktop and laptop computers 25,26,27, andthe like. One or more of the devices 12,14, such as the illustrated iMacand laptop computers 25,26, may connect to the network systems 16 viawireless access point 28. The various transmitting and receiving devices12,14 (one or both types of which being sometimes referred to herein as“client devices”) may be of any conventional design and manufacture.

FIG. 2 is a block diagram of a communication system 60 in accordancewith another preferred embodiment of the present invention. This system60 is similar to the system 10 of FIG. 1, except that the ASR system 18of FIG. 1 has been omitted and the ASR engine has instead beenincorporated into the various transmitting devices 12, including cellphones 61, smart phones 62, PDAs 63, tablet notebooks 64, variousdesktop and laptop computers 65,66,67, and the like.

It will be appreciated that the illustrations of FIGS. 1 and 2 areintended primarily to provide context in which the inventive features ofthe present invention may be placed. A more complete explanation of oneor more system architectures implementing such systems is providedelsewhere herein, in the incorporated applications and/or in theincorporated Appendices attached hereto. Furthermore, in the context oftext messaging, the communication systems 10,60 each preferablyincludes, inter alia, a telecommunications network. In the context ofinstant messaging, the communications systems 10,60 each preferablyincludes, inter alia, the Internet.

More particularly, and as described, for example, in the aforementionedU.S. Patent Application Pub. No. US 2007/0239837, FIG. 3 is a blockdiagram of an exemplary implementation of the system 10 of FIG. 1. Inthis implementation, the transmitting device 12 is a mobile phone, theASR system 18 is implemented in one or more backend servers 160, and theone or more network systems 16 include transceiver towers 130, one ormore mobile communication service providers 140 (operating under jointor independent control) and the Internet 150. The backend server 160 isor may be placed in communication with the mobile phone 12 via themobile communication service provider 140 and the Internet 150. Themobile phone 12 has a microphone, a speaker and a display.

A first transceiver tower 130A is positioned between the mobile phone 12(or the user 32 of the mobile phone 12) and the mobile communicationservice provider 140, for receiving an audio message (V1), a textmessage (T3) and/or a verified text message (V/T1) from one of themobile phone 12 and the mobile communication service provider 140 andtransmitting it (V2, T4, V/T2) to the other of the mobile phone 12 andthe mobile communication service provider 140. A second transceivertower 130B is positioned between the mobile communication serviceprovider 140 and mobile devices 170, generally defined as receivingdevices 14 equipped to communicate wirelessly via mobile communicationservice provider 140, for receiving a verified text message (V/F3) fromthe mobile communication service provider 140 and transmitting it (V5and T5) to the mobile devices 170. In at least some embodiments, themobile devices 170 are adapted for receiving a text message convertedfrom an audio message created in the mobile phone 12. Additionally, inat least some embodiment, the mobile devices 170 are also capable ofreceiving an audio message from the mobile phone 12. The mobile devices170 include, but are not limited to, a pager, a palm PC, a mobile phone,or the like.

The system 10 also includes software, as disclosed below in more detail,installed in the mobile phone 12 and the backend server 160 for causingthe mobile phone 12 and/or the backend server 160 to perform thefollowing functions. The first step is to initialize the mobile phone 12to establish communication between the mobile phone 12 and the backendserver 160, which includes initializing a desired application from themobile phone 12 and logging into a user account in the backend server160 from the mobile phone 12. Then, the user 32 presses and holds one ofthe buttons of the mobile phone 12 and speaks an utterance, thusgenerating an audio message, V1. At this stage, the audio message V1 isrecorded in the mobile phone 12. By releasing the button, the recordedaudio message V1 is sent to the backend server 160 through the mobilecommunication service provider 140.

In the exemplary embodiment of the present invention as shown in FIG. 3,the recorded audio message V1 is first transmitted to the firsttransceiver tower 130A from the mobile phone 12. The first transceivertower 130A outputs the audio message V1 into an audio message V2 thatis, in turn, transmitted to the mobile communication service provider140. Then the mobile communication service provider 140 outputs theaudio message V2 into an audio message V3 and transmits it (V3) to theInternet 150. The Internet 150 outputs the audio message V3 into anaudio message V4 and transmits it (V4) to the backend server 160. Thecontent of all the audio messages V1-V4 is identical.

The backend server 160 then transcribes the audio message V4 to textusing an SLM. The transcribed text is an unfiltered transcription whichis then filtered using one or more filters. The backend server 160determines one or more filters to apply, and an order in which to applythem, and then filters the transcription accordingly. Preferably, one ormore of these filters utilizes a finite grammar to refine the unfilteredtranscription. Some of these filters, however, may simply be softwarefilters utilizing software algorithms that alter the transcribed text.Exemplary filters of both types are described in more detailhereinbelow. The output of the filter process is a filteredtranscription.

The determination of the number and type of filters to be applied, aswell as the order in which they are to be applied, may be informed bydirect or indirect user selections. Information representing suchselection(s) may be transmitted to the backend server 160 together withthe audio message. Alternatively, this information may be provided inuser preference settings, which may be stored on either the mobile phone12, at the mobile communication service provider 140, on the Internet150, or at the backend server 160. As a further alternative, a user maysimply indicate a type of message to be sent (such as a text message oran instant message), or a specific recipient or type of recipient (suchas a work contact or a friend), and settings associated with thatselection, stored in one of the above numerated locations, may beutilized.

While it is preferred that transcription and filtering be performed at abackend server 160, it is possible that such a backend server 160 maycomprise a plurality of servers each communicating with at least oneother of the plurality of servers. In this case, the transcription andfiltering may occur on different servers, and filtering may even occuron a plurality of servers. It is also possible, however, that thebackend server 160 consists of a single server.

After the transcription and filtering, the filtered transcription issent as a text message, T1, and/or a digital signal, D1 back to theInternet 150, which outputs them into a text message T2 and a digitalsignal D2, respectively. The text message T1 and the digital signal D1correspond to two different formats of the audio message V4.

The digital signal D2 is transmitted to a digital receiver 180,generally defined as a receiving device 14 equipped to communicate withthe Internet and capable of receiving the digital signal D2. In at leastsome embodiments, the digital receiver 180 is adapted for receiving adigital signal converted from an audio message created in the mobilephone 12. Additionally, in at least some embodiments, the digitalreceiver 180 is also capable of receiving an audio message from themobile phone 12. A conventional computer is one example of a digitalreceiver 180. In this context, a digital signal D2 may represent, forexample, an email or instant message.

It should be understood that, depending upon the configuration of thebackend server 160 and software installed on the mobile phone 12, andpotentially based upon the system set up or preferences of the user 32,the digital signal D2 can either be transmitted directly from thebackend server 160 or it can be provided back to the mobile phone 12 forreview and acceptance by the user 32 before it is sent on to the digitalreceiver 180.

The text message T2 is sent to the mobile communication service provider140 that outputs it (T2) into a text message T3. The output text messageT3 is then transmitted to the first transceiver tower 130A. The firsttransceiver tower 130A then transmits it (T3) to the mobile phone 12 inthe form of a text message T4. It is noted that the substantive contentof all the text messages T1-T4 may be identical, which is thetranscribed and filtered text of the audio messages V1-V4.

Upon receiving the text message T4, the user 32 verifies it and sendsthe verified text message V/T1 to the first transceiver tower 130A thatin turn, transmits it to the mobile communication service provider 140in the form of a verified text V/T2. The verified text V/T2 istransmitted to the second transceiver tower 130B in the form of averified text V/T3 from the mobile communication service provider 140.Then, the transceiver tower 130B transmits the verified text V/T3 to themobile devices 170.

In at least one implementation, the audio message is simultaneouslytransmitted to the backend server 160 from the mobile phone 12, when theuser 32 speaks to the mobile phone 12. In this circumstance, it ispreferred that no audio message is recorded in the mobile phone 12,although it is possible that an audio message could be both transmittedand recorded.

Such a system may be utilized to convert an audio message into a textmessage. In at least one implementation, this may be accomplished byfirst initializing a transmitting device so that the transmitting deviceis capable of communicating with a backend server 160. Second, a user 32speaks to or into the client device so as to create a stream of an audiomessage. The audio message can be recorded and then transmitted to thebackend server 160, or the audio message can be simultaneouslytransmitted to the backend server 160 through a client-servercommunication protocol. Streaming may be accomplished according toprocesses described elsewhere herein and, in particular, in FIG. 4, andaccompanying text, of the aforementioned U.S. Patent Application Pub.No. US 2007/0239837. The transmitted audio message is then transcribedand filtered at the backend server 160 as described hereinabove. Thefiltered transcription is then sent as a text message back to the clientdevice 12. Upon the user's verification, the transcribed and filteredtext message is forwarded to one or more recipients 34 and theirrespective receiving devices 14, where the text message may be displayedon the device 14. Incoming messages may be handled, for example,according to processes described elsewhere herein and, in particular, inFIG. 2, and accompanying text, of the aforementioned U.S. PatentApplication Pub. No. US 2007/0239837.

Additionally, in at least one implementation, advertising messagesand/or icons may be displayed on one or both types of client devices12,14 according to keywords contained in the transcribed text message,wherein the keywords are associated with the advertising messages and/oricons. One or more such implementations are described in more detail inone or more of the incorporated references, including U.S. patentapplication Ser. No. 12/197,227.

Still further, in at least one implementation, one or both types ofclient devices 12,14 may be located through a global positioning system(GPS); and listing locations, proximate to the position of the clientdevice 12,14, of a target of interest may be presented in the convertedtext message. Additionally, filter selection and/or formattingpreferences may be altered or selected based upon a determined location,as described more fully hereinbelow.

FIG. 4A is a block diagram illustrating a first user 32 making use of aportion of the communication system 10 of FIG. 1. As shown therein, afirst user 32 is utilizing the system 10 to communicate with a seconduser 34. More particularly, the first user 32 in FIG. 4A is speaking anutterance 36 into the first device 12, which in this context may bereferred to as a “transmitting device,” and the utterance is sent asrecorded audio data to the ASR system 18. In FIG. 4A, the utterance 36is “Hey, do you want to meet for coffee?” This utterance may betransmitted to the ASR 18, which attempts to convert the speech intotext by first transcribing it using a statistical language model (SLM)and then applying one or more filters. In at least some embodiments, thefirst user 32 and/or the second user 34 may select, via user preferencesand/or directly, one or more filters to apply or not apply. Further, inat least some embodiments, the language text thus created may then betransmitted directly to the second device 14, which in this context maybe referred to as a “receiving device,” without further review by thefirst user 32. In other embodiments, the language text may first bedisplayed on the first device 12 for approval by the first user 32before being sent to the second device 14. FIG. 4B is a graphicaldepiction, on the first communication device 12, of a filteredtranscription of the utterance 36 of FIG. 4A.

FIG. 4C is a block diagram illustrating a second user 34 making use of aportion of the communication system 10 of FIG. 1. As shown therein, thesecond user 34 is utilizing the system 10 to communicate with the firstuser 32. More particularly, the second user 34 in FIG. 4C is speaking anutterance 38 into the second device 14, which in this context may bereferred to as a “transmitting device,” and the recorded speech audio issent to the ASR system 18. In FIG. 4C, the utterance 38 is “I can meetyou at twelve-thirty, but I can only stay twenty-five minutes.” Thisutterance may be transmitted to the ASR 18, which attempts to convertthe speech into text by first transcribing it using an SLM and thenapplying one or more filters. Once again, in at least some embodiments,the first user 32 and/or the second user 34 may select, via userpreferences and/or directly, one or more filters to apply or not apply.Further still, in at least some embodiments, the language text thuscreated may then be transmitted directly to the first device 12, whichin this context may be referred to as a “receiving device,” withoutfurther review by the second user 34. In other embodiments, the languagetext may first be displayed on the second device 14 for approval by thesecond user 34 before being sent to the first device 12. FIG. 4D is agraphical depiction, on the second communication device 14, of afiltered transcription of the utterance 38 of FIG. 4C.

A conversation between the two users 32,34 may continue in this fashion,with each user 32,34 speaking into his or respective communicationdevice 12,14, each utterance 36,38 being transcribed and filtered into afiltered transcription, and the filtered transcription being transmittedto the other device 14,12, either with or without the approval of theuser 12,14 before such transmission. FIG. 6 is a log of an exemplaryconversation, comprised of a series of utterances, between the two users32,34. Notably, each utterance of FIG. 6 is displayed in a formal mannerin that the utterance is shown with all words and numbers spelled outand with formal punctuation and capitalization.

FIG. 5 is a flowchart illustrating the operation of a speech engine, forexample of the ASR system 18 of FIG. 1, in accordance with one or morepreferred embodiments of the present invention. As shown therein, aprocess 700 carried out by the speech engine begins at step 705 with arecorded utterance 36,38 being received by the speech engine from atransmitting communication device 12,14. At step 710, the speech enginetranscribes the utterance 36,38 using a statistical language model (SLM)to create an unfiltered transcription. FIG. 7 is a log illustratingunfiltered transcriptions of the utterances of the exemplaryconversation of FIG. 6. Notably, the speech engine has injected“[silence]” tags into the unfiltered transcriptions to indicate shortperiods of silence in the recorded utterances 36,38.

At step 715, the speech engine determines whether one or more filtersshould be applied to the unfiltered transcription, and at step 720 thespeech engine determines an order in which filters should be applied.These determinations may be informed by information received togetherwith the recorded utterance and/or by user preferences, stored in one ormore of the locations as described hereinabove. In the present example,it is determined that a tag filter should be applied, followed by aseries of finite grammar filters, and then a software filter thatreformats the text into a form containing common text messagingabbreviations.

At step 725 a filter is used to eliminate, or alternatively to replacewith punctuation, the injected or inserted indications of silence. FIG.8 is a log illustrating filtered transcriptions of the recordedutterances of the exemplary conversation of FIG. 6, shown withindications of silence removed. Subsequently, another filter is used toidentify sequential word groupings which qualify to be applied to afinite grammar (or finite state grammar), which is understood to havethe meaning generally ascribed to such term in the field of speechrecognition. FIG. 9 is a log of the exemplary conversation of FIG. 6,shown with several such word groupings identified. Several examples ofsuch finite grammars are shown in Table 1, but it will be appreciatedthat any number of such finite grammars may be used without departingfrom the scope of the present invention. Each grouping of sequentialwords is then filtered using a selected finite grammar to convert theplain text into a more natural format. For example, unfilteredtranscription “i only have twenty five dollars” may be scanned using acurrency filter, which would determine that the words “twenty” “five”and “dollars” make up a sequential word grouping “twenty five dollars”.A date and time grammar is then applied to this sequential wordgrouping, and the output is used to replace the sequential wordgrouping, creating the filtered transcription “i only have $25”.

It will be appreciated that a single filter may implement, utilize orapply one or more finite grammars, or, preferably, a different filtermay be used to implement, utilize, or apply each finite grammar. FIG. 10is a log illustrating filtered transcriptions of the recorded utterancesof the exemplary conversation of FIG. 6 after a number of filters haveapplied a number of finite grammars to identified groupings.

TABLE 1 Filtered Unfiltered Transcription Finite Grammar Transcriptiontwelve thirty Date and Time Grammar 12:30 twenty five Numbers Grammar 25twenty dollars Currency Grammar $20Unfiltered Transcription Finite Grammar Filtered Transcription twelvethirty Date and Time Grammar 12:30 twenty five Numbers Grammar 25 twentydollars Currency Grammar $20 Table 1

Finally, the text is passed through a short message service (“SMS”)filter which converts identified words and/or word groupings to commonSMS shortcuts. FIG. 11 is a log illustrating filtered transcriptions ofthe recorded utterances of the exemplary conversation of FIG. 6, shownafter being passed through such an SMS filter.

The description above is exemplary in nature. A wide variety of filtersare available to format speech engine results, only a few of which havethus far been described.

Time Filter

A first such filter is a time filter. Functionality of an exemplary timefilter has been described hereinabove. A time filter can be used toformat time phrases. For example, the unfiltered transcription “twelvethirty p m” could be converted to the filtered transcription “12:30P.M.” Likewise, the unfiltered transcription “eleven o clock in themorning” could be converted to the filtered transcription “11:00 A.M.”In at least some embodiments, a user may select, either directly or viaa user preferences setting, a format he or she wishes time values to befiltered to.

Currency Filter

Exemplary functionality of a second such filter, a currency filter, wasalso described hereinabove. A currency filter can be used to formatmonetary amounts. For example, the unfiltered transcription “i need toborrow one hundred dollars” could be converted to the filteredtranscription “i need to borrow $100”, or, alternatively, “I need toborrow $100.00”. As with the time filter, in at least some embodiments,a user may select, either directly or indirectly, a format he or shewishes currency values to be filtered to.

Digit, Digit Format, Number, and Ordinal Filters

A digit filter can be used to format utterances of digits. For example,the unfiltered transcription “my phone number is seven seven seven sixfive zero three” could be converted to the filtered transcription “myphone number is 7 7 7 6 5 0 3”. Additionally, a separate digit formatfilter can be used which can also format utterances of digits. A digitformat filter will strip spaces from between digits and optionallyinsert one or more hyphens into digit strings with a length of 7, 10, or11. The filtered transcription above could be further filtered using thedigit format filter to the filtered transcription “my phone number is777-6503”.

It will be appreciated that the digit filter described above may notproperly handle larger numbers. To address this, a number filter mayadditionally be used to filter large numbers. For example, theunfiltered transcription “order five thousand widgets” could beconverted using the number filter to the filtered transcription “order5,000 widgets”.

Ordinal numbers can be treated with another filter. An ordinal numberfilter can be used to convert ordinal numbers, such as “first”,“sixtieth” and “thousandth”. For example, the unfiltered transcription“i finished in sixth place” could be converted to the filteredtranscription “i finished in 6th place”.

Date Filter

Another filter, a date filter, can be used to format dates. For example,the unfiltered transcription “he was born on the twenty second offebruary in seventeen twenty two” could be converted to the filteredtranscription “he was born on February 22, 1722”. Similarly, theunfiltered transcription “he was killed on march fifteenth forty four b.c.” could be converted to the filtered transcription “he was killed onMarch 15, 44 BC”. (Who are George Washington and Julius Caesar,respectively.)

Caller Name Filter

Another filter, a caller name filter, can be used to compare each wordin a transcription with each name (first, middle, last, etc.) of theoriginator or recipient of the message the transcription is associatedwith. This name is preferably extracted in the manner of caller ID, butalternatively may be extracted from an address book. For example, theunfiltered transcription “hey this is wheel call me back” could beconverted to “hey this is Will call me back”. When the utterance “hey,this is Will, call me back” is transcribed by the SLM, possiblealternate transcriptions, or alternate words of a transcription, may bestored in addition to an unfiltered transcription. By comparing eachname of the originator and/or recipient with alternate words of atranscription, it can be determined whether one of the transcribed wordsor phrases should be replaced with the name of the caller or recipient.

Caller Number Filter

Similarly, a caller number filter can be used to compare each word in atranscription with a number of the originator or recipient of themessage the transcription is associated with. This number is preferablyextracted in the manner of caller ID, but alternatively may be extractedfrom an address book. For example, the unfiltered transcription “heycall me back at 8531234” that was received from Will, whose phone numberis 8501234 could be converted to the filtered transcription “hey call meback at 8501234” (it is worth noting that a hyphen may further beinserted between the third and fourth digits, either by this filter, orby another filter, but such insertion has been omitted to simplify thisexample). It will be appreciated that this could be accomplished in anynumber of ways, such as, for example, comparing a plurality of digits ofa string of digits of the unfiltered transcription with a plurality ofdigits of the caller's number.

Closing Filter

Another filter, a closing filter, can be used to replace words at theend of a recorded utterance. For example, it is typical to end aconversation with “bye” or “thanks,” however, an SLM may transcribe thisspeech as “by” or “tanks”. The closing filter could be applied to theunfiltered transcription “please call my secretary tanks” to produce thetext “please call my secretary thanks”. Likewise, the unfilteredtranscription “Call me back by” could be converted to the filteredtranscription “Call me back bye”.

Greeting Filter

Similarly, a greeting filter can be used to replace words at thebeginning of a recorded utterance. For example, it is typical to beginconversations with “hi” or “hey,” however, an SLM may transcribe thesewords as “hay”, or possibly even “weigh” or “tie”. If a word at thebeginning of a transcription rhymes with a greeting word, it can bereplaced with the appropriate word it rhymes with. The greeting filtercould be applied to the unfiltered transcription “hay jeff this issandy” to produce the filtered transcription “hey jeff this is sandy”.

Hyphenate Filter

A spoken letter, for example “b”, may be transcribed by an SLM in avariety of ways. One common transcription method is to transcribe anindividually spoken letter as the lowercase letter followed by a period.For example, the utterance “My name is John Doe, spelled D O E” would betranscribed as “my name is john doe spelled d. o. e.” A filter may beused to render this output more easily readable. A hyphenate filter canconvert the transcribed text of such single spoken letters intohyphenated letters, so that the above unfiltered transcription wouldbecome the filtered transcription “my name is john doe spelled d-o-e”.

Contraction Filter

A contraction filter can be used to replace two or more words with acontraction of those words. For example, the unfiltered transcription “ican not do that” could be converted to the filtered transcription “ican't do that”.

Proper Noun Filter

A proper noun filter can be used to capitalize proper nouns. Forexample, the unfiltered transcription “go to las vegas nevada” could beconverted to the filtered transcription “go to Las Vegas Nevada”, oralternatively to the filtered transcription “go to Las Vegas, Nevada”.

Obscenity Filter

An obscenity filter can be used to replace obscene words with censoringcharacters or text. For example, the unfiltered transcription “i juststepped in dog shit” could be converted to the filtered transcription “ijust stepped in dog ####”, or alternatively, “i just stepped in dogpoo”.

Sentence Punctuation Filter

A Sentence Punctuation Filter attempts to punctuate text from an ASRsystem based on silence duration information that is provided by the ASRsystem as part of the transcription.

Essentially, the transcribed text is converted into sentences by addingperiods, commas, or other forms of punctuation based on silence durationinformation.

For example, suppose the ASR system generates the following text:

-   -   “hi this is bob <sil 0.56> i was wondering <sil 0.23> um <sil        0.13> if you are going to the football game”

In this example, the ASR system has detected three places of silence,represented by the <sil #.##> tags. The first is 0.56 milliseconds induration; the next is 0.23 milliseconds in duration; and the third is0.13 milliseconds in duration. Based on this silence durationinformation, the filter inserts punctuation characters. Specifically, apunctuation character is inserted between text immediately preceding andfollowing a silence duration that exceeds a predetermined threshold.

So, suppose the filter is configured to replace any silence durations of0.50 milliseconds and above with a period and any silence duration ofbetween 0:20 milliseconds and 0.49 milliseconds with a comma. Anysilence below 0.20 milliseconds is ignored.

When the filter is applied to the text, the result is:

-   -   “hi this is bob. I was wondering, um if you are going to the        football game”

As a secondary function, this filter also capitalizes the first letterof the next word if it inserts a period into the text. This is done tomaintain readability.

Formatting of the text into proper grammatical sentence structure is notnecessarily accomplished by this filter. Instead, the filter simplyinserts punctuation based on pause durations in speech.

Shout/Scream Filter

Speech at a high volume can be characterized as a shout, and speech atan even higher volume can be characterized as a scream. Phrasestranscribed by the ASR engine may contain an indication of such a highor abnormally high volume. In the event of such a high volume, ashout/scream filter may alter the transcribed text to further conveythis shout or scream. The text of the transcribed phrase may becapitalized and exclamation marks appended to the phrase. For example,the phrase “it is almost midnight”, which is associated with anindication that it was spoken at a high volume, may be converted to “ITIS ALMOST MIDNIGHT!”. Likewise, the phrase “help me”, which isassociated with an indication that it was spoken at an even highervolume, may be converted to “HELP ME!!!”.

Digit Homonym Filter

There are instances where the ASR system returns a word that sounds likethe word that was uttered, but actually is spelled differently. Thedigit homonym filter is configured to address instances like this.

Such instances are most obvious when someone utters a phone number andthe ASR system mistakenly returns “for” instead of “four” or “ate”instead of “eight”. This digit homonym filter is configured to replacethese misrecognized words with their corresponding numeric equivalents.

For example, suppose the following unfiltered transcription is returnedby the ASR system:

-   -   “call me back at three four for five one seven eight”

The word “for” is actually supposed to be the word “four”, but the ASRsystem misrecognized it as “for”. Applying the digit filter generatesthe following text:

-   -   “call me back at 3 4 for 5 1 7 8”

Next, applying the digit homonym filter generates the following text:

-   -   “call me back at 3 4 4 5 1 7 8”

In particular, the filter stores a list of known digit homonym words,which include “for”, “won”, “ate”, “to”, and “too”. If a digit homonymword from the list is encountered in the transcribed text, then thefilter looks at the word preceding it and the word following it to seeif they are both digits and, if so, then the digit homonym filterreplaces the homonym word with its numeric equivalent.

Note that the order of applying the digit filter and the digit homonymfilter is important; the digit filter should be applied first before thedigit homonym filter.

Tag/Engine Filter

When a spoken phrase is transcribed by an ASR engine, certain tags andsymbols may be inserted by the engine. A tag filter and/or an enginefilter may be used to remove these tags and symbols. For example, thetranscribed phrase “i just wanted <s> to thank you </s>” could beconverted to “i just wanted to thank you”.

SMS Filter

An SMS filter can be used to convert transcribed text into a format morecommonly used by a person while texting. For example, the spoken phrase“talk to you later” may be converted to “ttyl”. The SMS filter could beused to convert the transcribed phrase “i did not see you at the partyand wanted to say thanks for the gift talk to you later” to “i did not c@ the party and wanted to say thx 4 the gift ttyl”.

Priority Filter

A priority filter can be used to screen a transcription fordetermination as to a priority level of the utterance underlying thetranscription. For example, a priority filled can screen a transcriptionfor the words “hospital” or “emergency”. If one of these words is found,a priority level of a message associated with the transcription can beset and/or an action can be taken. For example, the unfilteredtranscription “meet me at the hospital, I broke my leg” may trigger thepriority filter and cause it to flag the associated message with ahigher priority. In the context of SMS messaging, a loud ring, alarm, orbeep may be triggered by an incoming SMS message having a high priority.In an email context, a higher priority email may be flagged as highpriority.

Screening

More generally, screening filters are known in the context of, forexample, email. Similar screening filters may be applied to screentranscriptions.

Ad Filter

An ad filter can be used to insert ads or clickable and/or voiceclickable links. These ads or links are associated with additionalcontent as is described more fully in one or more of the incorporatedreferences, including U.S. patent application Ser. No. 12/197,227. Anexisting word, phrase, sentence, or syllable can be converted to aclickable link. Each link can display additional information when a userinteracts with it via a user interface, such as by popping up additionalinformation when a user mouses over it. Engaging such a link, forexample by clicking on it or “voice clicking” it, can effect navigationto a webpage or otherwise provide additional content.

It will be appreciated the above filters can be used eitherindependently or in combination. It will further be appreciated thatwhen using the above described filter in combination, the order in whichthe filters are applied may alter the results. For example, because thesentence filter relies on indications of silence contained within tags,it must be applied before the tag filter is applied to remove tags. Inat least some embodiments, a user may select, either directly or viauser preference settings, which filters will be applied. In at leastsome embodiments a user may even select in which order the filters willbe applied.

The above described filters are software filters. At least some of themrepresent software algorithms designed to enhance and refine transcribedtext, while others utilize finite grammars to refine transcribed text,and still others represent a combination of both. Preferably, eachfilter comprises a software function or subroutine that may be called.

Commercial Implementation

One commercial implementation of the foregoing principles is the Yap®and Yap9™ service (collectively, “the Yap service”), available from YapInc. of Charlotte, N.C. The Yap service includes one or more webapplications and a client device application. The Yap web application isa J2EE application built using Java 5. It is designed to be deployed onan application server like IBM WebSphere Application Server or anequivalent J2EE application server. It is designed to be platformneutral, meaning the server hardware and OS can be anything supported bythe web application server (e.g. Windows, Linux, MacOS X).

FIG. 12 is a block diagram of the system architecture of the Yapcommercial implementation. With reference to FIG. 12, the operatingsystem may be implemented in Red Hat Enterprise Linux 5 (RHEL 5); theapplication servers may include the Websphere Application ServerCommunity Edition (WAS-CE) servers, available from IBM; the web servermay be an Apache server; the CTTS Servlets may include CTTS servletsfrom Loquendo, including US/UK/ES male and US/UK/ES female; the GrammarASP may be the latest WebSphere Voice Server, available from IBM;suitable third party ads may be provided by Google; a suitable thirdparty IM system is Google Talk, available from Google; and a suitabledatabase system is the DB2 Express relational database system, availablefrom IBM.

FIG. 13 is a block diagram of the Yap EAR of FIG. 12. The audio codecJARs may include the VoiceAge AMR JAR, available from VoiceAge ofMontreal, Quebec and/or the QCELP JAR, available from Qualcomm of SanDiego, Calif.

The Yap web application includes a plurality of servlets. As usedherein, the term “servlet” refers to an object that receives a requestand generates a response based on the request. Usually, a servlet is asmall Java program that runs within a Web server. Servlets receive andrespond to requests from Web clients, usually across HTTP and/or HTTPS,the HyperText Transfer Protocol. Currently, the Yap web applicationincludes nine servlets: Correct, Debug, Install, Login, Notify, Ping,Results, Submit, and TTS. Each servlet is described below in the ordertypically encountered.

The communication protocol used for all messages between the Yap clientand Yap server applications is HTTP and HTTPS. Using these standard webprotocols allows the Yap web application to fit well in a webapplication container. From the application server's point of view, itcannot distinguish between the Yap client midlet and a typical webbrowser. This aspect of the design is intentional to convince the webapplication server that the Yap client midlet is actually a web browser.This allows a user to use features of the J2EE web programming modellike session management and HTTPS security. It is also an importantfeature of the client as the MIDP specification requires that clientsare allowed to communicate over HTTP.

More specifically, the Yap client uses the POST method and customheaders to pass values to the server. The body of the HTTP message inmost cases is irrelevant with the exception of when the client submitsaudio data to the server in which case the body contains the binaryaudio data. The Server responds with an HTTP code indicating the successor failure of the request and data in the body which corresponds to therequest being made. Preferably, the server does not depend on customheader messages being delivered to the client as the carriers can, andusually do, strip out unknown header values. FIG. 14 is a typical headersection of an HTTP request from the Yap client.

The Yap client is operated via a user interface (UI), known as “Yap9,”which is well suited for implementing methods of converting an audiomessage into a text message and messaging in mobile environments. Yap9is a combined U1 for SMS and web services (WS) that makes use of thebuttons or keys of the client device by assigning a function to eachbutton (sometimes referred to as a “Yap9” button or key). Execution ofsuch functions is carried out by “Yaplets.” This process, and the usageof such buttons, are described elsewhere herein and, in particular, inFIGS. 9A-9D, and accompanying text, of the aforementioned U.S. PatentApplication Pub. No. US 2007/0239837.

Usage Process—Install: Installation of the Yap client device applicationis described in the aforementioned U.S. Patent Application Pub. No. US2007/0239837 in a subsection titled “Install Process” of a sectiontitled “System Architecture.”

Usage Process—Notify: When a Yap client is installed, the install fails,or the install is canceled by the user, the Notify servlet is sent amessage by the phone with a short description. This can be used fortracking purposes and to help diagnose any install problems.

Usage Process—Login: When the Yap midlet is opened, the first step is tocreate a new session by logging into the Yap web application using theLogin servlet. Preferably, however, multiple login servers exist, so asa preliminary step, a request is sent to find a server to log in to.Exemplary protocol details for such a request can be seen in FIG. 15. AnHTTP string pointing to a selected login server will be returned inresponse to this request. It will be appreciated that this selectionprocess functions as a poor man's load balancer.

After receiving this response, a login request is sent. Exemplaryprotocol details for such a request can be seen in FIG. 16. A cookieholding a session ID is returned in response to this request. Thesession ID is a pointer to a session object on the server which holdsthe state of the session. This session data will be discarded after aperiod determined by server policy.

Sessions are typically maintained using client-side cookies, however, auser cannot rely on the set-cookie header successfully returning to theYap client because the carrier may remove that header from the HTTPresponse. The solution to this problem is to use the technique of URLrewriting. To do this, the session ID is extracted from the session API,which is returned to the client in the body of the response. This iscalled the “Yap Cookie” and is used in every subsequent request from theclient. The Yap Cookie looks like this:

-   -   ;jsessionid=C240B217F2351E3C420A599B0878371A

All requests from the client simply append this cookie to the end ofeach request and the session is maintained:

-   -   /Yap/Submit;jsessionid=C240B217F2351E3C420A599B0878371A

Usage Process—Submit: After receiving a session ID, audio data may besubmitted. The user presses and holds one of the Yap-9 buttons, speaksaloud, and releases the pressed button. The speech is recorded, and therecorded speech is then sent in the body of a request to the Submitservlet, which returns a unique receipt that the client can use later toidentify this utterance. Exemplary protocol details for such a requestcan be seen in FIG. 17.

One of the header values sent to the server during the login process isthe format in which the device records. That value is stored in thesession so the Submit servlet knows how to convert the audio into aformat required by the ASR engine. This is done in a separate thread asthe process can take some time to complete.

The Yap9 button and Yap9 screen numbers are passed to the Submit serverin the HTTP request header. These values are used to lookup auser-defined preference of what each button is assigned to. For example,the 1 button may be used to transcribe audio for an SMS message, whilethe 2 button is designated for a grammar based recognition to be used ina web services location based search. The Submit servlet determines theappropriate “Yaplet” to use. When the engine has finished transcribingthe audio or matching it against a grammar, the results are stored in ahash table in the session.

In the case of transcribed audio for an SMS text message, a number offilters can be applied to the text returned from the ASR engine. Suchfilters may include, but are not limited to, those describedhereinabove.

Notably, after all of the filters are applied, both the filtered textand original text are returned to the client so that if text to speechis enabled for the user, the original unfiltered text can be used togenerate the TTS audio.

Usage Process—Results: The client retrieves the results of the audio bytaking the receipt returned from the Submit servlet and submitting it asa request to the Results servlet. Exemplary protocol details for such arequest can be seen in FIG. 18. This is done in a separate thread on thedevice and a timeout parameter may be specified which will cause therequest to return after a certain amount of time if the results are notavailable. In response to the request, a block of XML is preferablyreturned. Exemplary protocol details for such a return response can beseen in FIG. 19. Alternatively, a serialized Java Results object may bereturned. This object contains a number of getter functions for theclient to extract the type of results screen to advance to (i.e., SMS orresults list), the text to display, the text to be used for TTS, anyadvertising text to be displayed, an SMS trailer to append to the SMSmessage, etc.

Usage Process—TTS: The user may choose to have the results read back viaText to Speech. This can be an option the user could disable to savenetwork bandwidth, but adds value when in a situation where looking atthe screen is not desirable, like when driving. If TTS is used, the TTSstring is extracted from the results and sent via an HTTP request to theTTS servlet. Exemplary protocol details for such a request can be seenin FIG. 20. The request blocks until the TTS is generated and returnsaudio in the format supported by the phone in the body of the result.This is performed in a separate thread on the device since thetransaction may take some time to complete. The resulting audio is thenplayed to the user through the AudioService object on the client.Preferably, TTS speech from the server is encrypted using CorrectedBlock Tiny Encryption Algorithm (XXTEA) encryption.

Usage Process—Correct: As a means of tracking accuracy and improvingfuture SMS based language models, if the user makes a correction totranscribed text on the phone via the keypad before sending the message,the corrected text is submitted to the Correct servlet along with thereceipt for the request. This information is stored on the server forlater use in analyzing accuracy and compiling a database of typical SMSmessages. Exemplary protocol details for such a submission can be seenin FIG. 21.

Usage Process—Ping: Typically, web sessions will timeout after a certainamount of inactivity. The Ping servlet can be used to send a quickmessage from the client to keep the session alive. Exemplary protocoldetails for such a message can be seen in FIG. 22.

Usage Process—Debug: Used mainly for development purposes, the Debugservlet sends logging messages from the client to a debug log on theserver. Exemplary protocol details can be seen in FIG. 23.

Usage Process—Logout: To logout from the Yap server, an HTTP logoutrequest needs to be issued to the server. An exemplary such requestwould take the form: “/Yap/Logout;jsessionid=1234”, where 1234 is thesession ID.

User Preferences: In at least one embodiment, the Yap website has asection where the user can log in and customize their Yap clientpreferences. This allows them to choose from available Yaplets andassign them to Yap9 keys on their phone. The user preferences are storedand maintained on the server and accessible from the Yap webapplication. This frees the Yap client from having to know about all ofthe different back-end Yaplets. It just records the audio, submits it tothe server along with the Yap9 key and Yap9 screen used for therecording and waits for the results. The server handles all of thedetails of what the user actually wants to have happen with the audio.

The client needs to know what type of format to utilize when presentingthe results to the user. This is accomplished through a code in theResults object. The majority of requests fall into one of twocategories: sending an SMS message, or displaying the results of a webservices query in a list format. Notably, although these two are themost common, the Yap architecture supports the addition of new formats.

Alternative Contexts and Implementations

It will be appreciated that although one or more embodiments inaccordance with the present invention have been described above in thecontext of SMS messaging and instant messaging, the invention issusceptible of use in a wide variety of contexts and applications.Generally, it is contemplated that filters and finite grammars may beutilized in any context in which an ASR engine is utilized. Morespecifically, filters and finite grammars can be used in combinationwith an SLM in a voice mail context, a command context, a customerservice context, a contact navigation and input context, and anavigation context. In each of these contexts, transcription andfiltering may be performed either locally, or at a remote server (or aplurality of remote servers).

In a voice mail context, a voicemail is stored as recorded audio data,i.e. a recorded utterance. This recorded utterance can be transcribed totext using an SLM. This unfiltered transcription is then filtered usingone or more filters as described more fully hereinabove in the contextof SMS messaging. Preferably, the unfiltered transcription is filteredusing a finite grammar filter. The output of this process is a filteredtranscription that can be presented to a user as an SMS message, email,or instant message. It will be appreciated that after being transcribedto text, various additional filters other than those describedhereinabove may be utilized. For example, a screening filter may screenout messages that fail to include certain words or phrases selected bythe user. Similarly, a priority filter, similar to the one describedhereinabove in the context of SMS messaging, may be utilized toprioritize messages including certain words or phrases. For example,transcriptions containing the word “emergency” or “hospital” could beflagged as high priority and an action taken, such as, for example,sending an email to an address of the user.

In a command context, a user may speak an utterance that is heard by amicrophone of a user device. The utterance is stored as recorded audiodata, and the recorded utterance can then be transcribed to text usingan SLM. This unfiltered transcription is then filtered using one or morefilters as described more fully hereinabove in the context of SMSmessaging. Preferably, the unfiltered transcription is filtered using afinite grammar filter. As described above, this transcription andfiltering may be performed at a remote server. In this context, a filtermay alter the unfiltered transcription to represent instructions for theuser device in computer readable format. These instructions (whichrepresent a filtered transcription) may then be transmitted back to theuser device to be acted on by the user device.

In a customer service context, a user speaks an utterance that isrecorded as audio data. Preferably this user speaks an utterance into astandard telephone that is received by a remote server. This recordedutterance can then be transcribed to text using an SLM, either at thesame remote server or at a different remote server. The use of ASRengines in a customer service context is well known. Unlike inconventional use, however, the SLM transcription is filtered using oneor more filters as described more fully hereinabove in the context ofSMS messaging. Preferably, the unfiltered transcription is filteredusing a finite grammar filter.

In a contact navigation and input context, a user may speak an utterancethat is heard by a microphone of a user device. The utterance is storedas recorded audio data, and the recorded utterance can then betranscribed to text using an SLM. This unfiltered transcription is thenfiltered using one or more filters as described more fully hereinabovein the context of SMS messaging. Preferably, the unfilteredtranscription is filtered using a finite grammar filter. As describedabove, this transcription and filtering may be performed at a remoteserver. In this event, the filtered transcription is transmitted back tothe user device, which device may then perform an action based upon thefiltered transcription. For example, a user may utter “Add Bob to myContacts, seven zero four five five five three three zero zero.” Thisutterance may be transcribed by an SLM, either locally or remotely, to“add bob to my contacts seven zero for five five five three three zerozero”. This unfiltered transcription may then be filtered to machinereadable instructions to create a new contact named Bob with thespecified phone number. For example, one or more filters may be appliedto output the filtered transcription: “contacts.add(‘Bob, 7045553300’)”.The user device may then act on this filtered transcription to add a newcontact

In a navigation context, a user may speak an utterance that is heard bya microphone of a user device. The utterance is stored as recorded audiodata, and the recorded utterance can then be transcribed to text usingan SLM. This unfiltered transcription is then filtered using one or morefilters as described more fully hereinabove in the context of SMSmessaging. Preferably, the unfiltered transcription is filtered using afinite grammar filter. As described above, this transcription andfiltering may be performed either locally or at a remote server. In thiscontext, a filter may alter the unfiltered transcription to representinstructions for the user device in computer readable format. Theseinstructions (which represent a filtered transcription) may then betransmitted back to the user device to be acted on by the user device.

It will be appreciated that language varies widely among differentcultures, demographics, and geographic locales. Various filters andfinite grammars may be selectively utilized, or not, depending on these,and other, factors. For example, if a user is associated with the UnitedStates, either through his or her user preferences or a GPSdetermination (as described hereinabove), or otherwise, then the word“period” may be abbreviated “.” by an SMS filter. If a user isassociated with the United Kingdom, however, then the word “full stop”may be abbreviated “.” by an SMS filter. Further, it is contemplatedthat when transmitting messages from one user to another across locales,one or more filters may alter the message based on these locales. Forexample, a user in North Carolina may utter “I want a soda” and indicatethat the phrase is to be sent to a second user in Michigan. Theutterance may be stored as recorded audio data, and then transcribed ina backend server to “i want a soda”. A locale filter may then be appliedthat would replace the word “soda”, which is widely used in NorthCarolina, with the word “pop” which is widely used in Michigan. Applyingthis locale filter to the unfiltered transcription “i want a soda” wouldproduce the filtered transcription “i want a pop”. Preferably, one ormore finite grammar filters are applied as well.

Based on the foregoing description, it will be readily understood bythose persons skilled in the art that the present invention issusceptible of broad utility and application. Many embodiments andadaptations of the present invention other than those specificallydescribed herein, as well as many variations, modifications, andequivalent arrangements, will be apparent from or reasonably suggestedby the present invention and the foregoing descriptions thereof, withoutdeparting from the substance or scope of the present invention.

Accordingly, while the present invention has been described herein indetail in relation to one or more preferred embodiments, it is to beunderstood that this disclosure is only illustrative and exemplary ofthe present invention and is made merely for the purpose of providing afull and enabling disclosure of the invention. The foregoing disclosureis not intended to be construed to limit the present invention orotherwise exclude any such other embodiments, adaptations, variations,modifications or equivalent arrangements, the present invention beinglimited only by the claims appended hereto and the equivalents thereof.

1. A method for facilitating mobile device messaging, comprising thesteps of: (a) receiving audio data communicated from the mobilecommunication device, the audio data representing an utterance that isintended to be at least a portion of the text of the message that is tobe sent from the mobile communication device to a recipient; (b)transcribing the utterance to text based on the received audio data togenerate a transcription; (c) applying a filter to the transcribed textto generate a filtered transcription, the text of which is intended tomimic language patterns of mobile device messaging that is performedmanually by users; and (d) communicating the filtered transcription to amobile communication device.
 2. The method of claim 1, wherein themobile communication device to which the filtered transcription iscommunicated is the mobile communication device from which the audiodata is received.
 3. The method of claim 1, wherein the mobilecommunication device to which the filtered transcription is communicatedis a mobile communication device of the recipient of the message.
 4. Themethod of claim 1, wherein the audio data is communicated from themobile communication device using the HTTP/HTTPS protocol.
 5. The methodof claim 1, wherein the audio data is communicated from the mobilecommunication device over the Internet using the HTTP/HTTPS protocol. 6.The method of claim 1, wherein the utterance is transcribed using astatistical language model.
 7. The method of claim 1, wherein a filterincludes a list of predetermined words, including phrases andalphanumeric strings, each predetermined word being associated withanother predetermined word, including a predetermined phrase or apredetermined alphanumeric string; and wherein the step of applying afilter to the transcribed text comprises comparing words, includingphrases and alphanumeric strings, from the transcribed text to the listof words of the filter and, upon a match, replacing the matching word,including a phrase or alphanumeric string, with the associated,predetermined word including a predetermined phrase or a predeterminedalphanumeric string.
 8. The method of claim 1, wherein the filter thatis applied comprises a finite grammar.
 9. The method of claim 1, whereinthe filter that is applied comprises a software filter.
 10. The methodof claim 1, further comprising the step of selecting one or more filtersto apply to the transcribed text from a group of filters that may beapplied to the transcribed text to generate the filtered transcription.11. The method of claim 10, wherein the selection of the one or morefilters to apply is made based on an indication that is received inconjunction with the recorded audio data received from the mobilecommunication device.
 12. The method of claim 11, wherein the indicationis included within a header of the communication from the mobilecommunication device in which the audio data is received.
 13. The methodof claim 10, wherein the selection of the one or more filters to applyis made based on preferences of a user of a mobile communication device.14. The method of claim 13, wherein the mobile communication device ofthe user is the mobile communication device from which the audio data isreceived.
 15. The method of claim 10, wherein the group of filterscomprises an ad filter, a caller name filter, a caller number filter, aclosing filter, a contraction filter, a currency filter, a date filter,a digit filter, a digit format filter, a digit homonym filter, an enginefilter, a greeting filter, a hyphenate filter, a number filter, aprofanity filter, an ordinal filter, a proper noun filter, a punctuationfilter, a sentence filter, a shout/scream filter, an SMS filter, a tagfilter, and a time filter.
 16. The method of claim 1, wherein said stepof applying a filter to the transcribed text to generate a filteredtranscription comprises applying an ad filter, whereby advertisement isinserted into the transcribed text based on, and in association with,predetermined keywords that are identified in the transcribed text. 17.The method of claim 1, wherein the mobile communication device comprisesa mobile phone.
 18. A method for facilitating mobile device messaging,comprising the steps of: (a) receiving from a mobile communicationdevice, (i) a destination address for sending a message to a recipient,and (ii) audio data representing an utterance that represents the textof the message that is to be sent to the recipient; (b) transcribing theutterance to text based on the received audio data to generate atranscription; (c) applying a filter to the transcribed text to generatea filtered transcription, the text of which is intended to mimiclanguage patterns of mobile device messaging that is performed manuallyby users; and (d) communicating to the recipient the filteredtranscription as the text of the message.
 19. A method for facilitatingmobile device messaging, comprising the steps of: (a) receiving from amobile communication device, (i) a destination address for sending amessage to a recipient, and (ii) audio data representing an utterancethat represents the text of the message that is to be sent to therecipient; (b) transcribing the utterance to text based on the receivedaudio data to generate a transcription; (c) applying a filter to thetranscribed text to generate a filtered transcription, the text of whichis intended to mimic language patterns of mobile device messaging thatis performed manually by users; (d) communicating to the filteredtranscription to the mobile communication device; (e) presenting thefiltered transcription by the mobile communication device for verifying;and (f) sending to the recipient from the mobile communication devicethe filtered transcription as the text of the message.
 20. The method ofclaim 18, further comprising revising the filtered transcriptionpresented by the mobile communication device for verifying, wherein thefiltered transcription that is sent as the text of the message is arevised, filtered transcription. 21-50. (canceled)